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Abstract — New channel coding converse and achievability 
bounds are derived for a single use of an arbitrary channel. 
Both bounds are expressed using a quantity called the "smooth 
O-divergence", which is a generalization of Renyi's divergence of 
order 0. The bounds are also studied in the limit of large block- 
lengths. In particular, they combine to give a general capacity 
formula which is equivalent to the one derived by Verdu and 
Han. 

I. Introduction 

We consider the problem of transmitting information 
through a channel. A channel consists of an input alphabet X, 
an output alphabet y, where X and y are each equipped with 
a er-Algebra, and the channel law which is a stochastic kernel 
Py\x from X to y. We consider average error probabilities 
throughout this paper, 1 thus an (m, e)-code consists of an 
encoder / : {1, ...,m} — > X, i h4 x and a decoder 
g : y — > {1, . . . ,m},y h-» i such that the probability that 
i ^ i is smaller than or equal to e, assuming that the message 
is uniformly distributed. Our aim is to derive upper and lower 
bounds on the largest to given e > such that an (to, e)-code 
exists for a given channel. 

Such bounds are different from those in Shannon's original 
work [1] in the sense that they are nonasymptotic and do 
not rely on any channel structure such as memorylessness or 
information stability. 

Previous works have demonstrated the advantages of such 
nonasymptotic bounds. They can lead to more general channel 
capacity formulas [2] as well as giving tight approximations 
to the maximal rate achievable for a desired error probability 
and a fixed block-length [3]. 

In this paper we prove a new converse bound and a new 
achievability bound. They are asymptotically tight in the sense 
that they combine to give a general capacity formula that is 
equivalent to [2, (1.4)]. We are mainly interested in proving 
simple bounds which offer theoretical intuitions into channel 
coding problems. It is not our main concern to derive bounds 
which outperform the existing ones in estimating the largest 
achievable rates in finite block-length scenarios. In fact, as 
will be seen in Section VI, the new achievability bound is less 
tight than the one in [3], though the differences are small. 

1 Note that Shannon's method of obtaining codes that have small maximum 
error probabilities from those that have small average error probabilities [1] 
can be applied to our codes. We shall not examine other such methods which 
might lead to tighter bounds for finite block-lengths. 



Both new bounds are expressed using a quantity which we 
call the smooth O-divergence, denoted as -Dq('II') where 6 is a 
positive parameter. This quantity is a generalization of Renyi's 
divergence of order [4]. Thus, our new bounds demonstrate 
connections between the channel coding problem and Renyi's 
divergence of order 0. Various previous works [5], [6], [7] 
have shown connections between channel coding and Renyi's 
information measures of order a for a > h. Also relevant 
is [8] where channel coding bounds were derived using the 
smooth min- and max-entropies introduced in [9]. 

As will be seen, proofs of the new bounds are simple and 
self-contained. The achievability bound uses random coding 
and suboptimal decoding, where the decoding rule can be 
thought of as a generalization of Shannon's joint typicality 
decoding rule [1]. The converse is proved by simple algebra 
combined with the fact that 2Dq (- 1| •) satisfies a Data Processing 
Theorem. 

The quantity Dq(-||-) has also been defined for quantum 
systems [10], [11]. In [11] the present work is extended to 
quantum communication channels. 

The remainder of this paper is arranged as follows: in 
Section II we introduce the quantity Z)q(-||-); in Section III we 
state and prove the converse theorem; in Section IV we state 
and prove the achievability theorem; in Section V we analyze 
the bounds asymptotically for an arbitrary channel to study 
its capacity and e-capacity; finally, in Section VI we compare 
numerical results obtained using our new achievability bound 
with some existing bounds. 



II. The Quantity £$(•[[•) 

In [4] Renyi defined entropies and divergences of order 
a for every a > 0. We denote these H a (-) and D a (-\\-) 
respectively. They are generalizations of Shannon's entropy 
H(-) and relative entropy 

Letting a tend to zero in D a (-\\-) yields the following 
definition of Dq(-||-). 

Definition 1 (Renyi's Divergence of Order 0): For P and 
Q which are two probability measures on (il, J 7 ), Dq(P\\Q) 
is defined as 



L»o(P[|Q) = -log 



dQ, 



(1) 



supp(P) 



where we use the convention logO = — oo. 2 

We generalize £>o(-||-) to define -Dq('II') as follows. 
Definition 2 (Smooth O-Divergence): Let P and Q be two 

probability measures on (SljJ 7 ). For 5 > 0, Dq(P\\Q) is 

defined as 



D 5 (P\\Q)= sup 

*:fi^[0,l] 
/ n $dP>l-5 



{-iog/ n *d«} 



(2) 



Remark: To achieve the supremum in (2), one should choose 

dP 
dQ 



$ to be large (equal to 1) for large 4fi and vice versa. 



Lemma 1 (Properties of Dq(-\\-)): 

1) Dq(P\\Q) is monotonically nondecreasing in <5. 

2) When 8 = 0, the supremum in (2) is achieved by 
choosing $ to be 1 on supp(P) and to be elsewhere, 
which yields D° (P\\Q) = D (P\\Q). 

3) If P has no point masses, then the supremum in (2) is 
achieved by letting $ take value in {0, 1} only and 

D 5 (P\\Q) = sup A>(P'||Q). 

P>:±\\P'-Ph<8 

4) (Data Processing Theorem) Let P and Q be probability 
measures on (SI, J 7 ), and let W be a stochastic kernel 
from (SI, F) to (Vt',P). For all (5 > 0, we have 



Dt(P\\Q) >D s (WoP\\WoQ), 



(3) 



where W o P denotes the probability distribution on 
(SI 1 , T') induced by P and W and similarly for W oQ. 
Proof: The first three properties are immediate conse- 
quences of the definition and the remark. We therefore only 
prove 4). 

For any [0, 1] such that 



/ $'d(WoP) > 1-6, 



we choose $ : SI — > K to be 



$(w) = / &(u')W(&w'\uj), w G O. 
then we have that 3>(w) £ [0, 1] for all w £ fl Further, 
$dP = / <5>'d(WoP) > l-<5, 

/ $dQ = / $'d(VKoQ). 
Thus we have 

ip (-log / $do) 
>[o,i] I Jn J 



/ 
/ 

Jn 



sup 

J n 9dP>l-5 



> sup 

*':O'->[0,l] 
f n , *' d(VK°P)>l-5 



log^ $'d(VKoQ)| 



which proves 4). ■ 

2 We remark that for distributions defined on a finite alphabet, X, the 
equivalent of (1) is D (P\\Q) = - \ogJ2 x: p(x)>oQ( x )- 



A relation between D 5 (P\\Q), D(P\\Q) and the informa- 
tion spectrum methods [12], [13] can be seen in the next 
lemma. A slightly different quantum version of this theorem 
has been proven in [10]. We include a classical proof of it in 
the Appendix. 

Lemma 2: Let P n and Q n be probability measures on 
(Sl n , T n ) for every neE Then 



1 



lim Urn -D° (P„\\Q„) = {P n }- lim -log 



1 , dP n 



$10 „^oo n 



n AQ r 



(4) 



Here {P„}-Hni means the liminf in probability with respect 
to the sequence of probability measures {P n }, that is, for a 
real stochastic process {Z n }, 

{P n } - lim Z n = sup (a £ R : lim P n ({Z n < a}) = oj . 

In particular, let P xn and Q xn denote the product distribu- 
tions of P and Q respectively on (0® n , P 8 " 1 ), then 



1 



(5) 



lim lim -D* (P X "||Q X ") = D(P\\Q). 

Proof: See Appendix. ■ 

III. The Converse 
We first state and prove a lemma. 

Lemma 3: Let M be uniformly distributed over {1, . . . , m} 
and let M also take value in {1, ... , m}. If the probability that 
M ^ M is at most e, then 

logm < D e (P M m\\Pm x 

where P MJ g denotes the joint distribution of M and M while 
Pm and P^ denote its marginals. 

Proof: Let $ be the indicator of the event M = M, i.e., 

^ . Me{l m}. 

10, otherwise 

Because, by assumption, the probability that M ^ M is not 
larger than e, we have 



$dP M ^>l-e. 



'{l,...,m}® 2 

Thus, to prove the lemma, it suffices to show that 



J{l,...,m}S 



$d(P M xP^). (6) 



log m < — log 
To justify this we write: 

m 

$d(P M X P^) =J2 P M(W) ■ ^m(W) 
i=l 



{l,...,m}S 



i=l 
1 



from which it follows that (6) is satisfied with equality. 



Theorem 1 (Converse): An (m, e)-code satisfies 

log m < sup Dl (P X y\\Px x Py) , (7) 

Px 

where Pxy and Py are probability distributions on X x y 
and y, respectively, induced by Px and the channel law. 

Proof: Choose Px to be the distribution induced by the 
message uniformly distributed over {1, . . . , to}, then 

logTO<^ {P M m\\PmxPm) 
<Dl (Pxy\\Px x P Y ) , 

where the first inequality follows by Lemma 3; the sec- 
ond inequality by Lemma 1 Part 4) and the fact that 
M^-X^-Y^-M forms a Markov Chain. Theorem 1 
follows. ■ 

IV. ACHIEVABILITY 

Theorem 2 (Achievability): For any channel, any e > and 
e' e [0, e) there exists an (to, e)-code satisfying 

log to > sup D e o(P XY \\P x x P Y ) - log^^, (8) 

Px e — e 

where Pxy and Py are induced by Px and the channel law. 

The proof of Theorem 2 can be thought of as a general- 
ization of Shannon's original achievability proof [1]. We use 
random coding as in [1]; for the decoder, we generalize Shan- 
non's typicality decoder to allow, instead of the "indicator" 
for the jointly typical set, an arbitrary function on input-output 
pairs. 

Proof: For any distribution P x on X and any to 6 Z+, 
we randomly generate a codebook of size m such that the 
to codewords are independent and identically distributed ac- 
cording to Px- We shall show that, for any e', there exists 
a decoding rule associated with each codebook such that the 
average probability of a decoding error averaged over all such 
codebooks satisfies 

Pr(error) < (m - 1) • 2-°« '( p ^\\p x xp y ) + e ' (9 ) 

Then there exists at least one codebook whose average proba- 
bility of error is upper-bounded by the right hand side (RHS) 
of (9). That this codebook satisfies (8) follows by rearranging 
terms in (9). 

We shall next prove (9). For a given codebook and any 
$ : X x y -> [0, 1] which satisfies 

/ <$>dP XY >l-e', (10) 
J xxy 

we use the following random decoding rule: 3 when y is 
received, select some or none of the messages such that mes- 
sage j is selected with probability &{f{j),y) independently 
of the other messages. If only one message is selected, output 
this message; otherwise declare an error. 

3 It is well-known that, for the channel model considered in this paper, the 
average probability of error cannot be improved by allowing random decoding 
rules. 



To analyze the error probability, suppose i was the trans- 
mitted message. The error event is the union of £\ and £2, 
where £\ denotes the event that some message other than i is 
selected; £2 denotes the event that message i is not selected. 

We first bound Pr(£i) averaged over all codebooks. Fix 
f(i) and y. The probability averaged over all codebooks of 
selecting a particular message other than i is given by 

/ $(x,y)P x (dx). 
Jx 

Since there are (to — 1) such messages, we can use the union 
bound to obtain 

E[Pr(£i|/(*),y)] < (to - 1) • f $(x,y)P x (dx). (11) 

Jx 

Since the RHS of (11) does not depend on f(i), we further 
have 

E[Pr(£i|y)] < (to - 1) • f $(x,y)P x (dx). 
Jx 

Averaging this inequality over y gives 

E[Pr(^)] < im-l)J ^$(x,y)P x (dxf) P Y (dy) 

= (m-l) / <fd(P x xP Y ). (12) 
Jxxy 

On the other hand, the probability of £2 averaged over all 
generated codebooks can be bounded as 

E[Pr(£ 2 )] = f (l-$)dP X y 

Jxxy 

< e'. (13) 
Combining (12) and (13) yields 

Pr(error) < (to - 1) / $d (P x x P Y ) + e' . (14) 
Jxxy 

Finally, since (14) holds for every $ satisfying (10), we 
establish (9) and thus conclude the proof of Theorem 2. ■ 

V. Asymptotic Analysis 

In this section we use the new bounds to study the capacity 
of a channel whose structure can be arbitrary. Such a channel 
is described by stochastic kernels from X n to y n for all 
n E Z+, where X and y are the input and output alphabets, 
respectively. An (n, M, e)-code on a channel consists of an 
encoder and a decoder such that a message of size M can 
be transmitted by mapping it to an element of X n while the 
probability of error is no larger than e. The capacity and the 
optimistic capacity [14] of a channel are defined as follows. 

Definition 3 (Capacity and Optimistic Capacity): The ca- 
pacity C of a channel is the supremum over all R for which 
there exists a sequence of (n, M n , e„)-codes such that 

logM„ + 

> it, neZ T (15) 

n 

and 

lim e n — 0. 

n— ¥00 



The optimistic capacity C of a channel is the supremum over 
all R for which there exists a sequence of (n, M n , e n )-codes 
such that (15) holds and 

lim e„ = 0. 

n— >oo 

Given Definition 3, the next theorem is an immediate 
consequence of Theorems 1 and 2. 

Theorem 3 (Capacity Formulas): Any channel satisfies 

C = lim lim - sup D e {P X n Y n\\P X n x P y „) , (16) 

<40 n-S-oo n p xn 

C = lim lim" - sup (P x „ y „ xP y „). (17) 

ej.0 moo n p x „ 

Remark: According to Lemma 2, (16) is equivalent to [2, 
(1.4)]. It can also be shown that (17) is equivalent to [15, 
Theorem 4.4]. 

We can also use Theorems 1 and 2 to study the e-capacities 
which are usually defined as follows (see, for example, [2], 
[15]). 

Definition 4 ((.-Capacity and Optimistic (.-Capacity): The 
(-capacity C £ of a channel is the supremum over all R such 
that, for every large enough n, there exists an (n, M n , e)-code 
satisfying 

logM„ 



Thus, for n channel uses, the input and output alphabets are 
both {0, 1}™ and the channel law is given by 



> R. 



n 



The optimistic (-capacity C e of a channel is the supremum 
over all R for which there exist (n, M n , e)-codes for infinitely 
many ns satisfying 

logM n 



> R. 



n 



The following bounds on the e-capacity and optimistic e- 
capacity of a channel are immediate consequences of Theo- 
rems 1 and 2. They can be shown to be equivalent to those 
in [2, Theorem 6], [16, Theorem 7] and [15, Theorem 4.3]. 
As in those previous results, the bounds for C e (C e ) coincide 
except possibly at the points of discontinuity of C e (C e ). 

Theorem 4 (Bounds on (-Capacities): For any channel and 
any e e (0, 1), the e-capacity of the channel satisfies 

C e < lim - sup Dl (P xnyn \\P X n x P y „) , 

n— >oo Px n 

C t > lim lim - sup Dq (P x ~y™ ||Pjt» x P y ~) ; 
and the optimistic e-capacity of the channel satisfies 

C e < lim - supD e {P X n Y n\\Pxn xF y »), 

n^oo n p xn 

C t > lim lim - sup d( {P x ~yA\ p x™ x P y „) . 

e'fe «->oo n p x „ 

VI. Numerical Comparison with Existing Bounds 
for the BSC 

In this section we compare the new achievability bound 
obtained in this paper with the bounds by Gallager [5] and 
Polyanskiy et al. [3]. We consider the memoryless binary 
symmetric channel (BSC) with crossover probability 0.11. 



P Y n\ X n(y n \x n ) = 0. 1 1 _a; " 1 0.89"- 1* 



-X"\ 



where | • | denotes the Hamming weight of a binary vector. 
The average block-error rate is chosen to be 10~ 3 . 

In the calculations of all three achievability bounds we 
choose P X n to be uniform on {0, 1}™. For comparison we 
include the plot of the converse used in [3]. Our new converse 
bound involves optimization over input distributions and is 
thus difficult to compute. In fact, in this example it is less 
tight compared to the one in [3] since for the uniform input 
distribution D$- 001 (P X n Y n \\P X n x Py~) coincides with the 
latter. 

Comparison of the curves is shown in Figure 1. For the 




Fig. 1. Comparison of the new achievability bound with Gallager [5] and 
Polyanskiy et al. [3] for the BSC with crossover probability 0.11 and average 
block-error rate 10~ 3 . The converse is the one used in [3]. 



example we consider, the new achievability is always less 
tight than the one in [3], though the difference is small. It 
outperforms Gallager's bound for large block-lengths. 

Appendix 

In this appendix we prove Lemma 2. We first show that 
lim lim -D 5 (P n \\Q n ) > {P n } - lim -log-^f. (18) 
To this end, consider any a satisfying 



0<a<{P„}- lim -log^f. 



(19) 



Let A n (a) e P„, n e N, be the union of all measurable sets 
on which 

-log-^>a. (20) 
n dQ n 

Let $„ : fi„ — > [0, 1], n e N, equal 1 on A n (a) and equal 
elsewhere, then by (19) we have 

lim/ $ n dP„= lim P n (A n (a)) = l. (21) 

n— >oo Jq n—too 



Thus we have 



This implies that, for every S G (0, c), 



lim lim -D 5 (P n \\Q n ) 

> lim (--log I <3>„dQ„ 

n^oo \ Tl JVt n 



> lim [ log / $ n dP„-2-" a 

n Jo„ 



= lim f -ilog(2"" a ) 

7l->oo \ Jl 



(22) 



where the first inequality follows because, according to (21), 
for any S > 0, J n $„ dP„ = P„ > 1 - <5 for large 

enough n; the second inequality by (20) and the fact that <£>„ 
is zero outside A n (a); the next equality by (21). Since (22) 
holds for every a satisfying (19), we obtain (18). 

We next show the other direction, namely, we show that 



lim lim -D d (P n \\Q n ) < {P n } - lim -log 



1, dP n 



54.0 

To this end, consider any 



b>{P n }- lim -log d ' 



(23) 



(24) 



n AQ n ' 

Let A' n (b), n £ N, be the union of all measurable sets on 



which 



1 , dP„ 

— lOg ~T~Z — < 0. 



(25) 



n AQ r 

By (24) we have that there exists some c£ (0,1] such that 
lim" P n (A' n (b)) = c. (26) 

n— t- 00 

For every S G (0, c), consider any sequence of : 0„ — > 
[0, 1] satisfying 



/ 

Jo, 



$ n dP n > 1 - S, ne N. 



Combining (26) and (27) yields 



lim 



$„ dP n > c - <5. 



7l6 



(27) 



(28) 



(29) 



On the other hand, from (25) it follows that 

/ $«dQ„ > f $„dP„ .2" 
Combining (28) and (29) yields 



lim --log / $„ dQ„ ] < 6. 

Thus we obtain that for every S e (0, c) and every sequence 
$„ : n„ -> [0, 1] satisfying (27), 



lim (--log / $„dQ, 



< lim 

n— >oo 

< b. 



-log [ $„dQ„ 



lim -D 5 (P n \\Q n ) <b. 



(30) 



Inequality (30) still holds when we take the limit 6 4 0. Since 
this is true for every b satisfying (24), we establish (23). 
Combining (18) and (23) proves (4). 

Finally, (5) follows from (4) because, by the law of large 
numbers, 



1 d(P x ") 
n ° S d(Q x «) 



1 dP 
log ^Q 



D(P\\Q) (31) 



as n -> 00 P x "-almost surely. This completes the proof of 
Lemma 2. 
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